WO 2004/095274 



PCT/IB2004/050479 



1 

Method of and System to set an output quality of a media frame 



The invention relates to a method of setting an output quality of a next media- 
frame; wherein the output quality is provided by a media processing application; and the 
media processing application is designed to provide a plurality of output qualities of the next 
media-frame. 

5 The invention further relates to a system of setting an output quality of a next 

media-frame; comprising application means conceived to provide the output quality of a 
plurality of output qualities of the next media frame. 

The invention further relates to a computer program product designed to 
perform such a method. 

10 The invention further relates to a storage device comprising such a computer 

program product 

The invention further relates to a television set comprising such a system. 

An embodiment of such a method and system is disclosed in 
1 5 WO2002/0 19095. Here, a method of running an algorithm and a scalable programmable 

processing device on a system like a VCR, a DVD-RW, a hard-disk or on an Internet link is 
described. The algorithms are designed to process media frames, for example video frames 
while providing a plurality of quality levels of the processing. Each quality level requires an 
amount of resources. Depending upon the different requirements for the different quality 
20 levels, budgets of the available resources are assigned to the algorithms in order to provide an 
acceptable output quality of the media frames. However, the content of a media stream varies 
over time, which leads to different resource requirements of the media processing algorithms 
over time. Since resources are finite, deadline misses are likely to occur. In order to alleviate 
this, the media algorithms can run in lower than default quality levels, leading to 
25 correspondingly lower resource demands. 



It is an object of the invention to provide a method according to the opening 
paragraph that sets a quality of a media-frame in an improved way. In order to achieve this 
object, the method comprises setting the output quality of the next media frame based upon a 
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self-learning control strategy that uses a processing time and an output quality of a previous 
media-frame to determine the output quality of the next media-frame. 

An embodiment of the method according to the invention is described in claim 
5 2, wherein the method comprises: processing the previous media-frame; determine a state 
comprising of a relative progress value of the processed previous media-frame; a scaled 
budget value of the processed previous media-frame; and the output quality of the processed 
previous media-frame; determine a revenue based upon the state and a possible output quality 
of the next media-frame. 

10 

An embodiment of the method according to the invention is described in claim 

3, wherein the revenue is based upon a number of deadlines that were missed, the output 
quality of the previous media-frame, and a quality change. 

15 An embodiment of the method according to the invention is described in claim 

4, wherein the revenue for a finite number of states is determined, the finite number of states 
being determined by a finite set of scaled budget values and a finite set of relative progress 
values. 

An embodiment of the method according to the invention is described in claim 

5, comprising 

reducing the number of states for which the revenue is determined by reducing those states 
that only differ in the output quality of the processed previous media-frame. 

It is an object of the invention to provide a system according to the opening 
paragraph that sets an output quality of a media-frame in an improved way. In order to 
achieve this object, the system comprises control means conceived to set the output quality of 
the next media frame based upon a self-learning control strategy that uses a processing time 
and an output quality of a previous media frame to determine the output quality of the next 
media frame. 

Embodiments of the system according to the invention are described in claims 

7 and 8. 0 



WO 2004/095274 



PCT/IB2004/050479 



These and other aspects of the invention will be apparent from and elucidated 
with reference to the embodiments described hereinafter as illustrated by the following 
Figures: 



Figure 1 illustrates an agent environment interaction in Reinforcement 



Learning; 



Figure 2 illustrates a basic scalable video processing task; 

Figure 3 illustrates the task's processing behavior by means of an example 

timeline; 

10 Figure 4 illustrates the task's processing behavior by means of a further 

example timeline; 

Figure 5 illustrates an example timeline for b = P/2; 
Figure 6 illustrates a further example timeline for b = P/2; 
Figure 7 shows a plane in the space of Markov policies; 
15 Figure 8 illustrates an example state space for three quality levels; 

Figure 9 illustrates the main parts of the system according to the invention in a 
schematic way. 

Figure 1 illustrates an agent environment interaction in Reinforcement 
20 Learning. Reinforcement Learning (RL) is a computational approach to goal-directed 
learning from interaction, see for example R.S. Sutton and A.G. Barto, Reinforcement 
Learning: an introduction, MIT Press, Cambridge, MA 1998. It is learning what to do — how 
to map states to actions — so as to maximize a numerical revenue signal. The learner and 
decision maker is called the agent. The thing it interacts with, comprising everything outside 
25 the agent, is called the environment. The agent is not told which actions to take, but must 

discover which actions yield the most revenue by trying them. An action may affect not only 
the immediate revenue but also the next situation and, through that, all subsequent revenues. 
These two characteristics — trial-and-error search and delayed revenue — are the two most 
important distinguishing features of RL. 
30 RL is defined not by characterizing learning methods, but by characterizing a 

learning problem. Any method that is well suited to solving that problem is considered to be 
an RL method. One of the challenges in RL is the trade-off between exploration and 
exploitation. To obtain a lot of revenue, an RL agent must prefer actions that it has tried in 
the past and found to be effective in producing revenue. But to discover such actions, it has to 
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try actions it has not selected before. The agent has to exploit what it already knows in order 
to obtain revenue, but it also has to explore in order to make better action selections in the 
future. The dilemma is that generally neither exploration nor exploitation can be pursued 
exclusively without failing at the task. The agent must try a variety of actions and 
5 progressively favor those that appear to be the best. On a stochastic task, each action must be 
tried many times to gain a reliable estimate of its expected revenue. 

Apart from the agent and the environment, one can identify three main sub- 
elements of an RL system: a policy, a revenue function, and a value function. A policy 
defines the agenfs way of behaving at a given time. A policy is a mapping from states of the 
10 environment to actions to be taken in those states. In general, policies may be stochastic. A 
revenue function defines the goal in an RL problem. It maps each perceived state (or state- 
action pair) of the environment to a single number, a revenue, indicating the intrinsic 
desirability of that state. An RL agenfs sole objective is to maximize the total revenue it 
receives in the long run. Revenue functions may be stochastic. A value function specifies 
15 what is good in the long run. The value of a state is the total amount of revenue an agent can 
expect to accumulate over the future, starting from that state. Whereas revenues determine 
the immediate, intrinsic desirability of environmental states, values indicate the long-term 
desirability of states after taking into account the states that are likely to follow applying the 
policy, and the revenues available in those states. Values must be estimated and re-estimated 
from the sequences of observations an agent makes over its entire lifetime. 

The. agent 100 and the environment 102 interact continually, the agent 100 
selecting actions and the environment 102 responding to those actions and presenting new 
situations to the agent The environment 102 also gives rise to revenues, special numerical 
values that the agent 100 tries to maximize over time. The agent 100 and environment 

interact 102 at each of a sequence of discrete time steps, t = 0,1 A3 At each time step t, the 

agent 100 receives some representation of the environment's state, s„ e S , where S is the set 
of environmental states, and on that basis selects an action, a, e A(s,), where A <*,) is the 
set of actions available in state s, . One time step later, in part as a consequence of its action, 
the agent 100 receives a numerical revenue, r M e 9? , together with a new representation of 
the environmental state, s M . 

At each time step /, the agent 100 implements a mapping from states to 
probabilities of selecting each possible action. This mapping is called the agenfs policy and 
is denoted by n„ where k, (s,a) is the probability that a, = a if s, = s . A policy may also be 
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deterministic, which means that each state is mapped to a single action. RL methods specify 
how the agent 100 changes its policy as a result of its experience. The agent's goal, roughly 
speaking, is to maximize the total amount of revenue it receives in the long run. 

In RL, the goal of the agent 100 is formalized in terms of a special revenue 
5 signal passing from the environment 102 to the agent 100. At each time step t > 0, the 

revenue is a simple number, r t e SR Informally, the agent 100 f s goal is to maximize the total 
amount of revenue it receives. This means maximizing not the immediate revenue, but the 
cumulative revenue in the long run. If the agent 100 is expected to perform, revenues must be 
provided to it in such a way that in maximizing them the agent 100 will also achieve the 
10 goals. Therefore, the revenues must be set up such that they are in balance with the goal. 

The agent's goal is to maximize the revenues it receives in the long run. In 
general, it is expected to maximize the expected return, where the return, R, , is defined as 

some specific function of the revenue sequence. In the simplest case, the return is the sum of 
the revenues: 

15 R, =r /+1 +r /+2 +r /+3 +... + r r (1) 

where T is a final time step. This approach makes sense in applications where there is a 
natural notion of final time step, that is, when the agent-environment interaction breaks 
naturally into subsequences, which are called episodes, such as plays of a game, trips through 
a maze, or any sort of repeated interaction. Each episode ends in a special state called the 

20 terminal state, followed by a reset to a standard starting state or to a sample from a standard 
distribution of starting states. Tasks with episodes of this kind are called episodic tasks. 

On the other hand, in many cases the agent-environment interaction does not 
break naturally into identifiable episodes, but goes on continually without limit. These are 
called continuing tasks. For continuing tasks the final time step would be T = oo, therefore 

25 the return, which is what is maximized, could itself be infinite. The additional concept that is 
needed is discounting. According to this approach, the agent 100 tries to select actions so that 
the sum of the discounted revenues it receives over the future is maximized. In particular, it 
chooses at to maximize the expected discounted return: 

% = r M +7 r M +y 2 r /+3 + ...= 5>*W ( 2 ) 

30 where y is a parameter, 0 < y < 1, called the discount rate. The discount rate determines the 
present value of future revenues: a revenue received £ time steps in the future is worth only 
y times what it would be worth if it were received immediately. If y < 1, the infinite sum 
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has a finite value as long as the revenue sequence {r k } is bounded If y - 0, the agent 100 is 
'myopic' in being concerned only with maximizing immediate revenues. As y approaches 1, 
the objective takes future revenues into account more strongly: the agent 100 becomes more 
farsighted. 

5 Most RL algorithms are based on estimating value functions — functions of 

states (or stateaction pairs) that estimate how good it is for the agent 100 to be in a given state 
(or how good it is to perform a given action in a given state). The notion of Tiow good 1 is 
defined in terms of future revenues that can be expected, i.e. in terms of the expected return. 
The revenues the agent 100 can expect to receive in the future depend on what actions it will 
10 take. Accordingly, value functions are defined with respect to particular policies. 

Recall that a policy, n , is a mapping from each state, s , and action, a 
e A{s), to the probability x(s 9 a) of taking action a when in state s. Informally, the value of 
a state s under a policy k , denoted by V n (s) t is the expected return when starting in state s 
and following it thereafter: 

15 V*{s) = E n {R\ s^s} = E„{±y k r f , k+l \s t =*}. (3) 

Similarly, the value of taking action a in state s under a policy 7C 9 denoted Q n {s;a), is 
defined as the expected return starting from s, taking action a, and thereafter following policy 

Q"(s;a) = E K {R t \ s t = s 9 a, = a} = E n (^y *r /+Jt+1 1 s t = s,a t = a}. (4) 
20 Q n is called: the action- value function for policy n. 

To select an action at a time step, given the state s, a method is to behave 
greedy, i.e. to select the action a for which Q(s; a) is maximal. This method exploits current 
knowledge to maximize immediate revenue, but it spends no time exploring apparently 

25 inferior actions to see if they might really be better. A simple alternative is to behave greedily 
most of the time, but every once in a while, say with a probability e , instead select an action 
at random, uniformly, independently of the action- value estimates. Methods using this near- 
greedy action selection rule are called £ -greedy methods. 

Sarsa is a Temporal Difference (TD) learning method. TD learning methods 

30 can learn directly from raw experience without a model of the environment's dynamics, and 
they update estimates of values based in part of other learned estimates of values, without 
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waiting for a final outcome (they bootstrap). In Sarsa, the update rule for action-values is 
given by 

Q(s t ;a,) <- Q{s t ;a t )+a[r tH +y . Q(s M ;a M )-Q(s t ;a t )l (5) 
where s t denotes the state at a time step a ( denotes the action taken at time step t, 
5 r, +1 denotes the revenue received at the next time step, t + 1, s M denotes the state at the next 
time step, a t+l denotes the corresponding action to be taken, and <- denotes the update of the 
left-hand value with the right-hand value. This update is done after every transition from a 
state st. This rule uses every element of the quintuple of events, (s n a n r M9 s M9 a M ), that 
make up a transition from one state-action pair to the next. This quintuple gives rise to the 
1 0 name Sarsa for the algorithm. 

Below a learning algorithm based on the Sarsa update rule is given, for a 
continuing task: 

Algorithm SARSA 
15 a. initialize all Q(s;a) arbitrarily 

b. initialize s 

c. select the action a for which Q(s;a) is maximal (e - greedy) 

d. repeat 

e. take action a 

20 f. at the next time step, observe the resulting revenue r' and the new state 

s f 

g. select the action a 1 for which Q(s';a') is maximal (e -greedy) 

h. Q(s;d)<^Q(s;a)+a (r'+ y •fi(^;a , )-e(^;a)) 
i. 

25 

Consumer terminals, such as set-top boxes and digital TV-sets, currently apply 
dedicated hardware components to process video. In the foreseeable future, programmable 
hardware with video processing in software is expected to take over. Some of the 
characteristics of this, so called software video processing are: highly fluctuating, data 
30 dependent, resource requirements. 

With video processing there is usually a gap between the worst-case and 
average-case decoding times. Moreover, there is a distinction between short-term (or 
stochastic), and long-term (or structural) load fluctuations. Structural load fluctuations are, 
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amongst others, caused by the varying complexity of video scenes. Since worst-case resource 
allocation is usually less acceptable, due to high pressure on cost, resource allocation 
preferably has to be closer to average case. To prevent overload, some form of load reduction 
is inevitable. 

5 Soft timing requirements for tasks with fluctuating load such as acceptance of 

occasional deadline misses or an average-case response time requirement can be viewed as a 
special case of Quality of Service (QoS), 'the collective effect of service performances that 
determine the degree of satisfaction by a user of the service 5 , see ITU-T Recommendation 
E.800-Geneva 1994. The QoS abstraction provides a means to reason about and deal with 
10 tasks with heterogeneous soft timing requirements and heterogenous adaptive capabilities, 
such as approximate computing, or job skipping within a single system. 

Resource reservation, with temporal protection, allows to dissect the overload 
management problem for heterogenous soft-real time systems into a number of sub-problems 
that can be addressed separately. In this way, overload management and semantic (i.e. value- 
15 based) decision-making can be taken out of the scheduler. Two responsibilities remain to be 
addressed: deciding which task gets which budget, and adjusting the load of each task to its 
assigned budget. The first responsibility is global, and requires a unified QoS measure. The 
second responsibility is local, and may use task-specific QoS adaptation. 

Here local QoS control is conserned, i.e. trying to optimize the local QoS 
20 within the allocated budget, in the context of high-quality video processing. It is assume that 
the video processing task is scalable, i.e. that it can trade picture quality for resource usage at 
tiie level of individual frames, and that the task works ahead, i.e. that it can start processing 
the next frame immediately after completing the previous one, provided that the data are 
available. These scalable video algorithms provide a limited number of QoS levels that can 
25 be chosen for each frame. The extent to which working ahead can be applied is determined 
by latency and buffer constraints. The QoS specification for high-quality video combines 
three elements, which have to be balanced: processing quality, deadline misses, and quality 
changes. 

The balancing control strategies are concerned with two types of load 
30 fluctuations: short-teim (or stochastic), and structural. To control the short-term load 
fluctuations, the control problem is modeled as a Markov Decision Process, which is a 
general approach for solving discrete stochastic decision problems, see Markov Decision 
Processes: discrete stochastic dynamic programming, Wiley Series in Probability and 
Mathematical Statistics, Wiley-Interscience, New York, 1994, MX. Puterman. To deal with 
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structural load fluctuations, budget scaling is used: applying the original static or dynamic 
solution for a budget that is inversely proportional to the current structural load. 

Figure 2 illustrates a basic scalable video processing task. An single, 
asynchronous, scalable video processing task 200 is considered, with an associated controller 
5 202. The video processing task 200 can process frames at a (possibly small) discrete number 
of quality levels. The video processing task 200 retrieves frames to be processed from an 
input queue 204, and places processed frames in an output queue 210. For convenience, it is 
assumed that the successive frames are numbered 1,2,... . An input process 204 (for example 
a digital video tuner) periodically inserts frames into the input queue, with a period P, and an 

10 output process 206 (for example a video renderer) consumes frames from the output queue, 
with the same period P. Hence, it is assumed that the input and output frame rates are the 
same, but they could be different too. The input process 204 and the output process 206 are 
synchronized with a fixed latency S , i.e., if frame i enters the input queue 208 at time 
e, = e 0 + 1 • P 9 where e 0 is an offset, then the frame is consumed from the output queue 2 1 0 at 

1 5 time e, + 5 . Before processing a frame, the controller 202 selects the quality level at which 
the frame is processed. The processing time for a frame depends on both the chosen quality 
level and the data complexity of the frame. On average, the task has to process one frame per 
period P. By choosing the latency S larger than P, the task is given some space to even-out 
its varying load by working ahead. 

20 Consider a frame z, which enters the input queue at time e t . Clearly, e f is the 

earliest start time for processing the frame, and d t = e, + 5 is the latest possible completion 
time, thus the deadline. For convenience, a virtual deadline d 0 = e 0 + 8 is defined. The actual 
start time for frame z, the z-th start point of the task, is denoted by s t . The actual completion 
time for frame z, the z-th milestone of the task, is denoted by m, . With a non-zero processing 

25 time for frames it holds that m f > e r If m, > d t9 the task has missed its deadline for frame L 

If m M <e i9 assuming i > 1, the task is blocked from until e f . For 

i >l 9 s,> max{7« H , e,}. 

A work preserving approach is assumed, which means that a frame is not 
aborted if its deadline is missed, but is completed anyhow. Other approaches can be used 
30 too. The frame is then used for the next deadline. Note that, even additional deadlines of 

subsequent frames may be missed before the frame is completed. If a deadline is missed, the 
following actions are needed. First, the output process has to perform error concealment. For 
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example, a video renderer could reuse the most recently displayed frame. Such an error 
concealment can reduce the perceived quality, especially in scenes with a lot of motion. 
Second, the controller performs error recovery by skipping a subsequent frame, to restore the 
correspondence between frame number and deadline and to avoid a pile-up in the input 
5 queue. The frame to be skipped should be chosen carefully. For example, in MPEG- 
decoding, B-frames can safely be skipped, whereas skipping an I-frame can stall the stream. 

Figures 3 and 4 illustrate the task's processing behavior by means of two 
example timelines, in which P = 1, 8 = 2, and s } = d 0 = 0. The task has to process 5 frames. 
The frames actually processed are denoted in Figure 3 by reference numerals 301, 302, 304, 

10 and 305 and in Figure 4 by reference numerals 401, 402, 403, 404, and 405. In Figure 3, 
deadline d 2 is missed. The controller handles the deadline miss by using frame 302 at 
deadline d 3 and by skipping frame 303. In Figure 4, the task becomes blocked at milestone 
At, , because frame 404 is not present in the input queue (e 4 = d 2 ). 

Starting at d 0 , in the period between each pair of successive deadlines the task 

15 is assigned a guaranteed processingtime budget b (0 < b ^ P). Based on this guaranteed 

budget, a measure called progress is introduced. Progress p„ calculated at a start point s i9 is 
the total amount of guaranteed budget left until divided by b. This progress indicates 
how much budget is left after completing the previous frame i— 1 . Progress is an important 
measure for the controller, because a larger progress leads to a lower risk of missing the 

20 deadline for the frame to be processed. Progress is always non-negative; in case of deadline 
misses, this is ensured by using the completed frame at a later deadline. Furthermore, due to 
limited queue sizes there is also an upper bound p max = 5 - 1 on progress. Note that progress 
at a start point is computed based on the deadline of the just-completed frame. The reason not 
to compute progress at milestones is that budget losses due to blocking would otherwise not 

25 be accounted for in the progress. In case of blocking, the progress used by the controller at 
the first-next start point would then be too high (> p max ). 

In Figures 3 and 4, it is assumed that b=P, which means that the task has a 
private processor. In Figure 3, the progress at the successive start points is given by p x = 0, 
p 2 = 0.25, p 4 = 0.75, and p 5 = 0.75, respectively, and in Figure 4 by p x = 0, p 2 = 0.5, p 3 

30 = 1, p 4 = 1, and p 5 = 0.5, respectively. 

Figures 5 and 6 illustrate two example timelines for b = P/2. The task has to 
process 5 frames. The frames actually processed are denoted in Figure 5 by reference 
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numerals 501, 502, 504, and 505 and in Figure 6 by reference numerals 601, 602, 603, 604, 
and 605. Again, it is assumed that P = 1, 5 = 2, and d 0 = 0. It is further assumed that sj is 
the moment at which the task is assigned budget for the first time. In Figure 5, the progress at 
the successive start points is given by p, =0, p 2 = 0.5, p 4 = 0.75, and p 5 = 0.5, 
5 respectively, and in Figure 6 by = 0, p 2 = 0.5, p 3 = 0.75, p 4 = 1, and p 5 = 0.5, 

respectively. Note that in each period the budget is distributed differently, as determined by 
an underlying scheduler. In Figure 6, at m 3 the task has consumed half of its budget for that 
period The other half of the budget is lost due to blocking. 

As mentioned before, at each start point the controller has to select the quality 
10 level at which the upcoming frame is processed. Preferably, a control strategy is chosen that 
finds an optimal balance to meet the following three objectives: 

- because deadline misses and the accompanying frame skips result in artifacts in the output, 
deadline misses should be as sparse as possible. To prevent deadline misses, it may be 
necessary to process frames at lower quality levels. 

15 - to obtain a high output quality, frames should be processed at an as high as possible quality 
level. 

- the number and size of quality-level changes should be as low as possible, because (bigger) 
changes in the quality level may result in (better) perceivable artifacts. 

To find an optimal balance, a numerical revenue is assigned to each frame that 
20 is processed. A revenue is composed of a (possibly high) penalty on the number of deadlines 
missed while the frame is being processed, a reward for processing the frame at a particular 
quality level, and a penalty for processing the frame at a quality level that differs from the 
one used for the preceding frame. Any control strategy that maximizes the average revenue 
over a sequence of frames balances the three objectives. Moreover, the average revenue 
25 provides a tunable QoS metric for the task. 

If, for each quality level, the processing time for each frame is known in 
advance, finding a control strategy that maximizes the average revenue can be computed. In 
that case, the optimal quality levels can be computed off-line using dynamic programming, 
see Dynamic Programming, Princeton University Press, Princeton, NJ, 1957 R.E. Bellman. 

30 

As a first step towards a run-time control strategy, the system is modeled as a 
Markov Decision Process (MDP). An MDP considers a set of states, and a set of actions for 
each state. At discrete moments in time, the control points, a controller observes the current 
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state s of the system, and subsequently takes an action a. This action influences the system, 
and, as a result, the controller observes a new state s' at the next discrete moment. This new 
state is not deterministically determined by the action and the previous state, but each 
combination (s,a,s') has a fixed known probability. A numerical revenue is associated with 
5 each state transition (s, $') . The goal of the MDP is to find a decision strategy that maximizes 
the average revenue over all state transitions during the lifetime of the system. 

Here, the discrete moments at which the controller observes the system are the 
start points s, . The state includes the task's progress at that start point, p, . Because quality- 
level changes are penalized, the state also includes the quality level used for the preceding 

10 frame (the previous quality level q t _ x ) . Hence, s t = (p„? M ). Finally, an action is the 

selection of a quality level g t9 and the revenues for each state change are defined according 
to the description above. 

With a first strategy, referred to as MDP strategy, solve the MDP is solved off- 
line. This implies that the state transition probabilities Pr(s,a,s') are needed in advance. 

15 Therefore, the per-frame processing times are measured for a number of representative video 
sequences, at different quality levels, and these sequences are used to compute the state 
transition probabilities. The MDP is then solved off-line for a particular value of the budget 
b. This results in a (static) Markov policy, which is a set of state-action pairs, here: 
( Atfff-itfi)- During run time, at each start point, the controller decides its action by 

20 consulting the static Markov policy, a simple table look-up. 

The MDP can also be solved at run-time, by means of Reinforcement Learning 
(RL), as previously described. An RL control strategy starts with no knowledge at all, and 
learns optimal behavior from the experience it gains during run time. The state-action values 
are applied to choose the quality level at start points. Given the state, the quality level (= 

25 action) yielding the largest state-action value is chosen. This approach is referred to as the RL 
control strategy. 

As previously described there are short-term and structural load fluctuations. 
Sharp transitions between structural load values are quite exceptional. In general, the 
transitions are much more smooth. 
30 The MDP and RL control strategies implicitly assume that the processing 

times of successive frames are mutually independent. This is roughly the case for short-term 
load fluctuations, but not for structural load fluctuations. In order to deal with the structural 
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load fluctuations too, the following enhancements can be applied to the MDP and RL 
strategies: 

- tracking the structural load during run time, by filtering out the short-term load fluctuations, 
and comparing it to a reference budget 

- compensating the original MDP and RL strategies for structural load fluctuations relative to 
this reference budget, not by adjusting the allocated budget, but by applying the policy 
derived for an inversely proportional budget, also caUed: the scaled budget. 
These enhanced strategies are denoted by MDP* and RL*, respectively. 

To track the structural load at a start point, the ratio between the actual 
processing time apt of the just-completed frame and the expected processing time ept for a 
frame at the applied quality level, Le. cf= apt/ept must be determined. The expected 
processing times have been derived off-line for each quality level. This ratio is referred to as 
a complication factor cf. 

It is assumed that the complication factor for a frame is more or less 
independent of the applied quality level: if the frame is processed at a different quality level, 
it should give roughly the same complication factor. This assumption is needed because the' 
processing time for the quality level at which the completed frame has been processed can be 
measured, which is not necessarily the quality level selected for subsequent frames. 

The complication factors follow the shortterm and the structural load 
fluctuations. To obtain a more proper measure for the structural load, the short-term load 
fluctuations are preferably filtered out, to obtain a running complication factor rcf Several 
types of filters are suitable for this purpose, such as FIR, UR, and median filters, see Digital 
Signal Processing, Prentice-Hall, Englewood Cliffs, NJ, 1975, A.V. Oppenheim and R.W. 
Schafer. Applying an IIR filter. For example an exponential, recency-weighted-average with 
25 a step-size parameter of 0.05 can be used. 

The ninning complication factor rcf is the basis for the scaled budget. If rcf 
deviates from one, it appears as if the processing budget available to the task deviates from 
its available budget b. If rcf= 1 .2, a budget b = 30 ms would appear as a budget of only 25 
ms. If rcf- 0.8, that same budget would appear as a budget of 37.5 ms. Therefore, ^scaled 
budget is defined as b/rcf. During run time, the scaled budget is computed at each start point. 

The MDP* strategy enhances the MDP strategy in the following way. First, 
the statistics needed for solving the MDP are normalized, which means that the structural ' 
load fluctuations are filtered out. In this way the short-term load fluctuations is separated 
from the structural ones. In the off-line phase, the MDP is solved for a set of selected scaled 
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budgets, resulting in a set of Markov policies, one for each scaled budget in the set. Next, 
during run-time, the new quality level at a start point is taken from the policy that 
corresponds to the actual value of the scaled budget If the required policy is not in the set, 
the desired value is obtained by linear interpolation in the space of Markov policies. 

Figure 7 shows a plane in the space of Markov policies, for one particular 
previous quality level q 2 . In this plane, a vertical line at scaled budget value 28.2 ms 
corresponds to the q 2 column in the Markov policy for scaled budget 28.2 ms, which is 
obtained by interpolation from the policies for scaled budgets 28.0 ms and 28.5 ms. 

In the RL* approach the scaled budget is directly added to the state, i.e., the 
scaled budget becomes a third dimension of the state space. At a start point, given the state 
(=scaled budget , progress, previous quality level), the quality level (= action) yielding the 
largest state-action value is preferably chosen. 

Within the RL* approach, the agent 100 as previously described with 
reference to Fig. 1, is a controller, selecting the quality level at which frames are processed. 
The environment 102 is given by the scalable video processing task. The discrete time steps 
at which the agent interacts with its environment are the start points. The task's state at a start 
point is defined by the combination of the scaled budget (sb), the progress (p), and the 
previous quality level (pq). An action is the choice of a quality level q at which a frame is 
processed. For states s = (sb,p,pq) and actions q, the agent 100 keeps track of action-values 
Q(s;g). 

After processing a frame, at the start point of the subsequent frame to be 
processed, the agent first updates the scaled budget using the processing time of the just- 
completed frame. This updated scaled budget is part of the state at the start point. Next, the 
agent computes the revenue for the just-completed frame. For notational convenience, it is 
assumed that the just-completed frame was processed at quality level q and that the frame 
before that was processed at quality level pq. The revenue is composed of a (high) 
negatively-valued penalty on the number of deadlines that were missed since the previous 
start point, a positively-valued reward for the quality level q at which the frame was 
processed, and a negatively-valued quality-change penally qcp(pq, q) for changing the 
quality level fiom W to q. Note that the agent computes the revenue based on information 
provided by the environment (number of deadline misses, quality levels), instead of receiving 
the revenue directly from the environment. Using the revenue, the agent updates (learns) its 



WO 2004/095274 



PCT7IB2004/050479 



15 

action- values. After that, the updated action-values are used to select the quality level for the 
next frame to be processed, i.e. the frame that corresponds to the start point. 

Within the computations needed, a finite number of states can be considered, 
while both the scaled budget and the progress are continuous variables. To address this a 
5 finite set of scaled budget values SB = {sb t . ., sb n } and a finite set of progress values 

R = {pi p m } are defined. Then only for gridpoint states s, i.e. states s = (sb, p, pq) for 
which sb<= SB and p € R, track of action-values Q(s;q) must be kept. To approximate the 
action- value for a non-gridpoint state, linear interpolation on the action- values of the 
surrounding gridpoint states is applied. 
10 Figure 8 illustrates an example state space for three quality levels, q 0 to q 2 . In 

this state space, the scaled budget points are 10 ms, 20 ms, 30 ms, and 40 ms, and progress 
points are 0.25, 0.75, 1.25, and 1.75. To approximate the action- value for a state with a scaled 
budget of 25 ms, a progress of 1, and a previous quality level q 0 , linear interpolation is 

applied on the action- values of the four surrounding gridpoint states, as indicated in the Fig. 
15 8. 

In each iteration of the Sarsa algorithm, normally one action- value is learned 
(updated). As a result, learning can take a long time, and there can be a need for exploring 
actions (which is often not optimal). With the current invention, in each iteration (at each 
start point) the action values for all grid point states are updated, which learns faster. 

20 Moreover, there is no longer a need for exploring actions, which means that what has been 
learned can be exploited better. At a start point, the processing time of the just-completed 
frame, pt is determined. This frame was processed at a particular quality level, q. To estimate 
the processing time for the frame at a different quality level, the off-line determined ept- 
values (expected processing time) are used that were also used for budget scaling. For 

25 example, if a frame, processed at quality level q 2t yields a processing time of 20ms t and if ept 
(tfo ) =l5 ms md 9* (q 2 ) =22 ms, then the estimated processing time for the frame at quality 

level q 0 is 20ms » ept jf 0 ^ = 13.6 ms. The estimated processing times are used to simulate 

processing the frame. Starting at a grid point state s t , and taking a particular quality-level 
action q n using the estimated processing time for quality level q t the resulting (non-grid 
30 point) state s t+l after processing the frame, the corresponding greedy quality-level action q t+l 
and the resulting revenue r /+1 .can be computed. In this computation, first the processing time 
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for budget scaling (normalization step) is corrected. Using this information, the Sarsa update 
rule is applied. At each start point, this is done preferably for all grid point states and all 
quality-level actions. Consequently, there is preferably no need to take a random (non- 
greedy) action every now and then. The invention can be implemented by the following 
5 algorithms, wherein: sbp denotes the point wherein the scaled budget is calculated, i.e. the 
scaled budget point; rpp denotes the point wherein the relative progress is calculated, i.e. the 
relative progress point; and pq denotes the previous quality. 

Algorithm initialize 



10 la. initialize the running complication factor 

rc/<-l 

lb. for all states (sbp, rpp, pq) 

lc. for all quality actions q 

1 d. initialize the (state,action)- value 

15 Q(sbp,rp,pq;q) 0 



Algorithm get decision quality 
Input: relative progress rp 
Input: previously used quality pq 
20 Output: decision quality dq 



2a. compute the scaled budget 

sb<r-b/rcf 

2b for scaled budget sb, relative progress rp, and previous quality pq, 

compute the interpolated (state,action)-values Q ivec (sb,rp,pq;q) for all 
25 possible quality actions q 

2c. decision quality dq is the quality action q that corresponds to the 



highest value Q ivec (sb, rp,pq;q) 

Algorithm update (state,action)-values 
30 Input: processing time pt 
Input: processing quality q 

3a. make a copy of the running complication factor corresponding to the 

situation that existed before processing the last unit of work 
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oldrcf <- rcf 

3b. use pt and q to update the running complication factor 

rcf <-rc/ + a.( P [ -rcf) 

3c. compute the scaled budget 

5 sb<r-b/rcf 

3d. for all states (sbp t rpp t pq) 

3e. for all quality actions q 

3f. estimate the processing time of the last unit of work 

for quality q 

io ept ^B^S) pt 

avg(q) 

3g. simulate processing the last unit of work in quality 

q , starting in state (sbp, rpp, pq) , and having a 
normalized processing time ept/ oldrcf 
3h. observe both the resulting revenue rev and the 

15 resulting relative progress rp 

3i. for scaled budget sb (derived in 3c), relative 

progress rp, and previous quality q, compute the 
interpolated (state,action)- values Qivec(sb,rp, q ;q § ) 
for all possible quality actions q* 
20 3j. Q' is the highest value Q ivec (sb, rp, q; q ') 

3k. update the (state.action)- value Qfsbp, rpp t pq; q ) 

using rev and Q' 
Q(sbp, rpp t pq; q) = 

Q(sbp t rpp,pq; q) + 

25 p. (rev + y Q - Q(sbp t rpp t pq; q)) 



To reduce the number of states in computations, the following technique may 
be applied. Let s x = (sb,p 9 pq x ) and s y = (sb,p 9 pq y ) be gridpoint states that only differ in 
the previous quality level, pq x and pq y respectively. The processing time for a frame is 
30 independent of the quality level applied for the preceding frame. Therefore, at a start point, if 
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quality level q is chosen in either state s x ors y , the resulting state at the next start point is the 
same. In terms of action-values, this means that 

Q( s x\<l)-qcp<j)q xi q) = Q(s y \q)-qcp(pq y9 q). This observation can be used as follows to 
reduce the number of states in computations. To learn action- values two-dimensional 
gridpoint states are used, i.e., all combinations of a scaled budget from set SB and a progress 
from set R . To obtain the action- value Q ((sb y p , pq); q) for choosing quality level q in a 3 - 
dimensional gridpoint state (sb 9 p,pq), a penalty qcp(pq;q) to the learned action- value 
QX(sb,p);q) is added. In other words, Qi(sb,p,pq);q) = Q\(sb,py 9 q) + qcp(pq>q), and 
action- value Q is learned. In this way, the number of states to be updated is reduced by a 
factor |g| , where Q is the set of quality levels. 

The order in the described embodiments of the method of the current 
invention is not mandatory, a person skilled in the art may change the order of steps or 
perform steps concurrently using threading models, multi-processor systems or multiple 
processes without departing from the concept as intended by the current invention. 

Figure 9 illustrates the main parts of the system according to the invention in a 
schematic way. The system 900 comprises a microprocessor 914, a software bus 912 and a 
memory 916. The memory 916 can be a random access memory (RAM). The memory 916 
communicates with the microprocessor 914 through software bus 912. The memory 916 
comprises computer readable code 902, 904, 906, 908, 910, and 912. The computer readable 
code 902 is designed to provide the output quality of a plurality of output qualities of the next 
media frame. The computer readable code 904 is designed to set the output quality of the 
next media frame based upon a self-learning control strategy that uses a processing time and 
an output quality of a previous media frame to determine the output quality of the next 
media frame. The computer readable code 906 is designed to process the previous media- 
frame. The computer readable code 908 is designed to determine a state comprising of a 
relative progress value of the processed previous media-frame; a scaled budget value of the 
processed previous media-frame; and the output quality of the processed previous media- 
frame. The computer readable code 910 is designed to determine a revenue based upon the 
state and a possible output quality of the next media-frame. The computer readable code 912 
is designed to reduce the number of states for which the revenue is determined by reducing 
those states that only differ in the output quality of the processed previous media-frame. The 
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system can be comprised within a television set. Furthermore, the computer readable code 
can be read from a computer readable medium such as a CD or DVD. 



It should be noted that the above-mentioned embodiments illustrate rather than 
5 limit the invention, and that those skilled in the art will be able to design many alternative 
embodiments without departing from the scope of the appended claims. In the claims, any 
reference signs placed between parentheses shall not be construed as limiting the claim. The 
word "comprising" does not exclude the presence of elements or steps other than those listed 
in a claim. The word "a" or "an" preceding an element does not exclude the presence of a 

10 plurality of such elements. The invention can be implemented by means of hardware 

comprising several distinct elements, and by means of a suitably programmed computer. In 
the system claims enumerating several means, several of these means can be embodied by 
one and the same item of computer readable software or hardware. The mere fact that certain 
measures are recited in mutually different dependent claims does not indicate that a 

15 combination of these measures cannot be used to advantage. 



